14 research outputs found

    A simulation framework for rapid prototyping and evaluation of thermal mitigation techniques in many-core architectures

    Get PDF
    International audienceModern SoCs are characterized by increasing power density and consequently increasing temperature, that directly impacts performances, reliability and cost of a device through its packaging. Thermal issues need to be predicted and mitigated as early as possible in the design flow, when the optimization opportunities are the highest. In this paper, we present an efficient framework for the design of dynamic thermal mitigation schemes based on a high-level SystemC virtual prototype tightly coupled with efficient power and thermal simulation tools. We demonstrate the benefit of our approach through silicon comparison with the SThorm 64-core architecture and provide simulation speed results making it a sound solution for the design of thermal mitigation early in the flow

    Multilevel simulation-based co-design of next generation HPC microprocessors

    Get PDF
    This paper demonstrates the combined use of three simulation tools in support of a co-design methodology for an HPC-focused System-on-a-Chip (SoC) design. The simulation tools make different trade-offs between simulation speed, accuracy and model abstraction level, and are shown to be complementary. We apply the MUSA trace-based simulator for the initial sizing of vector register length, system-level cache (SLC) size and memory bandwidth. It has proven to be very efficient at pruning the design space, as its models enable sufficient accuracy without having to resort to highly detailed simulations. Then we apply gem5, a cycle-accurate micro-architecture simulator, for a more refined analysis of the performance potential of our reference SoC architecture, with models able to capture detailed hardware behavior at the cost of simulation speed. Furthermore, we study the network-on-chip (NoC) topology and IP placements using both gem5 for representative small- to medium-scale configurations and SESAM/VPSim, a transaction-level emulator for larger scale systems with good simulation speed and sufficient architectural details. Overall, we consider several system design concerns, such as processor subsystem sizing and NoC settings. We apply the selected simulation tools, focusing on different levels of abstraction, to study several configurations with various design concerns and evaluate them to guide architectural design and optimization decisions. Performance analysis is carried out with a number of representative benchmarks. The obtained numerical results provide guidance and hints to designers regarding SIMD instruction width, SLC sizing, memory bandwidth as well as the best placement of memory controllers and NoC form factor. Thus, we provide critical insights for efficient design of future HPC microprocessors.This work has been performed in the context of the European Processor Initiative (EPI) project, which has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement № 826647. A special thanks to Amir Charif and Arief Wicaksana for their invaluable contributions to the SESAM/VPSim tool in the initial phases of the EPI project.Peer ReviewedPostprint (author's final draft

    Standard-compliant Parallel SystemC simulation of Loosely-Timed Transaction Level Models

    Get PDF
    International audienceTo face the growing complexity of System-on-Chips (SoCs) and their tight time-to-market constraints, Virtual Prototyping (VP) tools based on SystemC/TLM must get faster while keeping accuracy. However, the Accellera SystemC reference implementation remains sequential and cannot leverage the multiple cores of modern workstations. In this paper, we present a new implementation of a parallel and standard-compliant SystemC kernel, reaching unprecedented performances. By coupling a parallel SystemC kernel and memory access monitoring, we are able to keep SystemC atomic thread evaluation while leveraging the available host cores. Evaluations show a×19 speed-up compared to the Accellera SystemC kernel using 33 host cores reaching speeds above 2000 Million simulated Instructions Per Second (MIPS)

    Standard-compliant parallel SystemC simulation of loosely-timed transaction level models: From baremetal to Linux-based applications support

    No full text
    International audienceTo face the growing complexity of System-on-Chips (SoCs) and their tight time-to-market constraints, Virtual Prototyping (VP) tools based on SystemC/TLM2.0 must get faster while maintaining accuracy. However, the ASI SystemC reference implementation remains sequential and cannot leverage the multiple cores of modern workstations. In this paper, we present SCale 2.0, a new implementation of a parallel and standard-compliant SystemC kernel, reaching unprecedented simulation speeds. By coupling a parallel SystemC kernel with shared resources access monitoring and process-level rollback, we can preserve SystemC atomic thread evaluation while leveraging the available host cores. We also generate process interaction traces that can be used to replay any simulation deterministically for debug purpose. Evaluation on baremetal applications shows ×15 speedup compared to the ASI SystemC kernel using 33 host cores reaching speeds above 2300 Million simulated Instructions Per Second (MIPS). Challenges related to parallel simulation of full software stack with modern operating systems are also addressed with speedup reaching ×13 during recording run and ×24 during the replay run

    Fast Virtual Prototyping for Embedded Computing Systems Design and Exploration

    Get PDF
    International audienceVirtual Prototyping has been widely adopted as a cost-effective solution for early hardware and software co-validation. However, as systems grow in complexity and scale, both the time required to get to a correct virtual prototype, and the time required to run real software on it can quickly become unmanageable. This paper introduces a feature-rich integrated virtual prototyping solution, designed to meet industrial needs not only in terms of performance, but also in terms of ease, rapidity and automation of modelling and exploration. It introduces novel methods to leverage the QEMU dynamic binary translator and the abstraction levels offered by SystemC/TLM 2.0 to provide the best possible trade-offs between accuracy and performance at all steps of the design. The solution also ships with a dynamic platform composition infrastructure that makes it possible to model and explore a myriad of architectures using a compact high-level description. Results obtained simulating a RISC-V SMP architecture running the PARSEC benchmark suite reveal that simulation speed can range from 30 MIPS in accurate simulation mode to 220 MIPS in fast functional validation mode

    Leveraging distributed GraphLab for program trace analysis

    No full text
    International audienceGraph-mining is a class of data-mining problems where programs involve the processing of data modeled as graphs. These applications often exhibit irregular and data-dependent communication patterns, hampering parallelization opportunities on distributed architectures. Many tools and frameworks were created for the scalable processing of graphs but their comparison is non-trivial on distributed architectures as there is no efficiency metrics with respect to distributed resource usage. Considering an in-house use-case, program trace analysis for parallelization optimizations, we study the benefits and limits of a graph-processing framework for a tangible application. The algorithm was implemented using GraphLab and executed on a humble 7-node commodity cluster with input instances up to 40 million vertices and 50 million edges. We propose in this paper an in-depth analysis of the GraphLab system to evaluate its performance and scalability in the context of program trace analysis. The analysis is driven both by traditional and domain-specific metrics and contributes to a better understanding of the system behavior

    Exploration of de Bruijn Graph Filtering for de novo Assembly Using GraphLab

    No full text
    International audienceThe emergence of next generation DNA sequencers has raised interest in short read de novo assembly of whole genomes. Though numerous frameworks were developed in the held, the presence of errors in reads as well as the increasing size of datasets call for scalable preprocessing methods for noise hltering. In this paper we present a hltering algorithm that targets determination of valid k-mers in a de Bruijn graph built from short reads. Such preprocessing will help increase accuracy and reduce memory footprint in further assembly procedures by removing erroneous k-mers from the datasets at an early stage. The algorithm leverages GraphLab, a scalable graph processing framework not previously used in traditional assembly toolchains. The accuracy of the algorithm was evaluated with synthetic datasets exhibiting various error rates and proven to be able to determine large parts of de Bruijn graphs on datasets with error level greater than real-life datasets. The implementation is executed on a distributed cluster and a study of its scalability and operating performances is conducted and exhibits interesting scaling properties, hence demonstrating the relevance of GraphLab in such a context
    corecore